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Abstract 

This work presents a systematic study of objective evaluations of abstaining classifications using Information-Theoretic 
Measures (ITMs). First, we define objective measures for which they do not depend on any free parameter This defi- 
nition provides technical simplicity for examining ''objectivity'' or ''subjectivity'' directly to classification evaluations. 
Second, we propose twenty four normaUzed ITMs, derived from either mutual information, divergence, or cross- 
^ ' entropy, for investigation. Contrary to conventional performance measures that apply empirical formulas based on 

users' intuitions or preferences, the ITMs are theoretically more sound for realizing objective evaluations of classi- 

^D ' fications. We apply them to distinguish "error types" and "reject types" in binary classifications without the need 

for input data of cost terms. Third, to better understand and select the ITMs, we suggest three desirable features 
for classification assessment measures, which appear more crucial and appealing from the viewpoint of classification 

[^ . applications. Using these features as "meta-measures", we can reveal the advantages and limitations of ITMs from a 

higher level of evaluation knowledge. Numerical examples are given to corroborate our claims and compare the dif- 
ferences among the proposed measures. The best measure is selected in terms of the meta-measures, and its specific 
properties regarding error types and reject types are analytically derived. 
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QQ , 1. Introduction 



The selection of evaluation measures for classifications has received increasing attentions from researchers on var- 
ious application fields 111] ||2|]||31]||4|]||5[] |l6t]||7[]. It is well known that evaluation measures, or criteria, have a substantial 
impact on the quality of classification performance. The problem of how to select evaluation measures for the overall 
quality of classifications is difficult, and there appears no universal answer to this. 

Up to now, various types of evaluation measures have been used in classification applications. Taking a binary 
classification as an example, more than thirty metrics have been applied for assessing the quality of classifications 
and their algorithms as given in Table 1 of Lavesson and Davidsson's paper |J5t]. Most of the metrics listed in this 
table can be considered a type of performance-based measures. In practice, other types of evaluation measures, such 
as Information-Theoretic Measures (ITMs), have also commonly been used in machine learning |8]|9]. The typical 
information-based measure used in classifications is the cross entropy jlOll . In a recent work jllll . Hu and Wang 
derived an analytical formula of the Shannon-based mutual information measure with respect to a confusion matrix. 
Significant benefits were derived from the measure, such as its generality even for cases of classifications with a reject 
option, and its objectivity in naturally balancing performance-based measures that may conflict with one another (such 
as precision and recall). The objectivity was achieved from the perspective that an information-based measure does 
not require knowledge of cost terms in evaluating classifications. This advantage is particularly important in studies 
of abstaining classifications 11211 i4ll and cost sensitive learning il3llil4ll . where cost terms may be required as input 
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data for evaluations. Generally, if no cost terms are assigned to evaluations, it implies that the zero-one cost functions 

I — I 
are applied 1 15]. In such situations, classification evaluations without a reject option may still be applicable and useful 

in class-balanced datasets. Problematic, or unreasonable, results will be obtained for evaluations in situations where 

classes are highly skewed in the datasets yfl if no specific cost terms are given. 

In this work, for simplifying discussions, we distinguish, or decouple, two study goals in evaluation studies, 

namely, evaluation of classifiers and evaluation of classifications. The former goal concerns more about evaluation 

of algorithms in which classifiers applied. From this evaluation, designers or users can select the best classifier The 

latter goal is to evaluate classification results without concerning which classifier is applied. This evaluation aims 

more on result comparisons or measure comparisons. One typical example was demonstrated by Mackay [16] for 

highlighting the difficulty in classification evaluations. He showed two specific confusion matrices, Co and C^, in 

binary classifications with a reject option: 

Cd — r! X V , Cb — n ^ ^ , with C 



TN FP RN 
FN TP RP 



(1) 



where the confusion matrix is defined as C in eq. (1) , and "TN", "TP", "FN", "FP", "RN", "RP" represent "true 
negative" , "true positive", "false negative", "false positive" , "reject negative", "reject positive", respectively. For the 
given data, users may ask "which measures will be proper for ranking them". If directly applying "True Positive Rate- 
False Positive Rate" curve (also called ROC) or "Precision-Recall" curve, one may conclude that the performance 
of C^ is better than that of Co . This conclusion is proper since the two sets of data share the same reject rate 
(=11%). Generally, "Error-Reject" curve is mostly adopted in abstaining classifications. Based on this evaluation 
approach, one may consider the performances of two classifications have no difference because they show the same 
error rate (-6%) and reject rate. Mackay |16 ] first suggested applying mutual-information based measure in ranking 
classifications, and through which Hu and Wang (referring to M5-M6 in Table 3, [11]) observed that Co is better than 
C^. If reviewing the two matrices carefully with respect to imbalanced classes, one may agree with the observation 
because the small class in Co receives more correct classifications than that in Ce- 

We consider the example designed by Mackay lll6ll is quite stimulating for study of abstaining classification 
evaluations. The implications of the example form the motivations of the present work on addressing three related 
open problems, which are generally overlooked in the study of classification evaluations as follows: 

I. How to define "proper" measures in terms of high-level knowledge for abstaining classification evaluations? 

II. How to conduct an objective evaluation of classifications without using cost terms? 

III. How to distinct or rank "error types" and "reject types" in classification evaluations? 



Conventional binary classifications usually distinguish two types of misclassification errors 11151] 111611 if they result 
in different losses in applications. For example, in medical applications, "Type I Error" (or "false positive") can be 
an error of misclassifying a healthy person to be abnormal, such as cancer On the contrary, "Type II Error"(or "false 
negative") is an error where cancer is not detected in a patient. Therefore, "Type II Error" is more costly than "Type I 
Error". Based on the same reason for identifying "error types" in binary classifications, there is a need for considering 
"reject types" if a reject option is applied. Of the existing measures, we consider information-theoretic measures to 
be most promising in providing "objectivity" in classification evaluations. A detailed discussion on the definition of 



"objectivity" is given in Section 3. This work is an extension of our previous study 1 1111 . However, the work aims at 
a systematic investigation of information measures with specific focus on "error types" and "reject types". The main 
contribution of the work is derived from the following three aspects: 

I. We define the "proper" features, also called "meta-measures" , for selecting candidate measures in the con- 
text of abstaining classification evaluations. These features will assist users in understanding advantages and 
Umitations of evaluation measures from a higher level of knowledge. 

II. We examine most of the existing information measures in a systematic investigation of "error types" and "reject 
types" for objective evaluations. We hope that the more than twenty measures investigated are able to enrich 
the current bank of classification evaluation measures. For the best measure in terms of the meta-measures, we 
present a theoretical confirmation of its desirable properties regarding error types and reject types. 
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III. We reveal the intrinsic shortcomings of information measures in evaluations. The discussions are intended to 
be applicable to a wider range of classification problems, such as similarity ranking. In addition, we are able to 
employ the measures reasonably in interpreting classification results. 

To address classification evaluations with a reject option, we assume that the only basic data available for clas- 
sification evaluations is a confusion matrix, without input data of cost terms. The rest of this letter is organized as 
follows. In Section 2, we present related work for the selection of evaluation measures. For seeking ''proper" mea- 
sures, we propose several desirable features in the context of classifications in Section 3. Three groups of normalized 
information measures are proposed along with their intrinsic shortcomings in Sections 4 to 6, respectively. Several 
numerical examples, together with discussions, are given in Section 7. Finally, in Section 8 we conclude the work. 

2. Related Work 

In classification evaluations, a measure based on classification accuracy has traditionally been used with some 
success in numerous cases 1151 . This measure, however, may suffer serious problems in reaching intuitively reasonable 
results from certain special cases of real-world classification problems ||3|]. The main reason for this is that a single 
measure of accuracy does not take into account error types. 

To overcome the problems of accuracy measures, researchers have developed many sophisticated approaches 



tor classmcation assessment lll /lllllal . Among these, two commonly^used approaches are KUC (Receiver Uperatmg 
Characteristic) curves and AUC (Area under Curve) measures [l]fl9]. ROC curves provide users with a very fast 
evaluation approach via visual inspections, but this is only applicable in limited cases with specific curve forms (for 
example, when one curve is completely above the other). AUC measures are more generic for ranking classifica- 
tions without constraints on curve forms. In a study of binary classifications, a formal proof was given by Ling et 
al. uJ] showing that AUC is a better measure than accuracy from the definitions of both statistical consistency and 
discriminancy. Sophisticated AUC measures were reported recently for improving robustness ia] and coherency ||7|] 
of classifiers. Drummond and Holte [20] proposed a visualization technique called ''Cost Curve", which is able to 
take into account of cost terms for showing confidence intervals on classifier's performance. Japkowicz |3] presented 
convincing examples showing the shortcomings of the existing evaluation methods, including accuracy, precision vs. 
recall, and ROC techniques. The findings from the examples further confirmed the need for methods using measure- 
based functions | 211 . The main idea behind measure-based functions is to form a single function with respect to a 
weighted summation of multiple measures. The measure function is able to balance a trade-ofi" among the conflicting 
measures, such as precision and recall. However, the main difficulty arises in the selection of balancing weights for the 
measures [5]. In most cases, users rely on their preferences and experiences in assigning the weights, which imposes 
a strong degree of subjectivity on the evaluation results. 

Classification evaluations become more complicated if a classifier abstains from making a prediction when the 
outcome is considered unreliable for a specific sample. In this case, an extra class, known as the "reject" or "unknown" 
class, is added to the classification. In recent years, the study of abstaining classifiers has received much attention 
II22I1 112311 i 1 211 i4ll 11241 . With complete data of a full cost matrix, they were able to assess the classifications. If one term 



of the cost matrix was missing, such as on a reject cost term, the approaches for classification evaluations generally 
failed. Moreover, because in most situations the cost terms are given by users, this approach is basically a subjective 
evaluation in applications. Vanderlooy et al. [251 further investigated the ROC isometrics approach which does not 
rely on information from a cost matrix. This approach, however, is only applicable to binary classification problems. 
A promising study of o bject ive evaluations of classifications is attributed to the introduction of information theory. 
Kvalseth f26] and Wickens [27] derived normalized mutual information (NMI) measures in relation to a contingency 



table. Further pioneering studies on the classification problems were conducted by Finn 112811 and Forbes [29]. Forbes 
ll29ll discussed the problem that NMI does not share a monotonic property with the other performance measures, such 
as accuracy or F-measure. Several different definitions for information measures have been reported in studies of 
classification assessment, such as information scores by Kononenko and Bratko |30] and KL divergence by Nishii 
and Tanaka 1|31|1 . Yao, et al. ||8|] and Tan, et al. i32ll summarized many useful information measures for studies of 
associations and attribute importance. Significant efforts were made on discussing the desired properties of evaluation 



measures 113211 . Principe, et al. [9] proposed a framework of information theoretic learning (ITL) that included 
supervised learning as in classifications. Within this framework, the learning criteria were the mutual information 
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defined from the Shannon and Renyi entropies. Two quadratic divergences, namely, the Euclidean and Cauchy- 
Schwartz distances were also included. 

From the perspective of information theory, Wang and Hu i33n derived for the first time the nonlinear relations 
between mutual information and the conventional performance measures (accuracy, recall and precision) for binary 



classification problems. They llllll extended the investigation into abstaining classification evaluations for multiple 
classes. Their method was based solely on the confusion matrix. For gaining the theoretical properties, they derived 
the extremum theorems concerning mutual information measures. One of the important findings from the local mini- 
mum theorem is the theoretic revelation of the non-monotonic property of mutual information measures with respect 
to the diagonal terms of a confusion matrix. This property may cause irrational evaluation results from some data in 
classifications. They confirmed this problem by examining specific numerical examples. Theoretical investigations 
are still missed for other information measures, such as divergence-based and cross-entropy based ones. 

3. Objective Evaluations and Meta-Measures 



This work focuses on objective evaluations of classifications. While Berger 113411 stressed four points from a 



philosophical position for supporting objective Bayesian analysis, it seems that few studies in the literature address 



the ''objectivity" issue in the study of classification evaluations. Some researchers II32I1 may call their measures to be 
objective ones without defining them formally. Considering that ''objectivity'" is a more philosophical concept without 
a well accepted definition, we propose a scheme for defining "objective evaluations" from the viewpoint of practical 
implementation and examination. 

Definition 1. Objective evaluations and measures. An objective evaluation is an assessment expressed by a 
function that does not contain any free parameter. This function is called an objective measure. 

Remark 1. When a free parameter is used to define a measure, it usually carries a certain degree of subjectivity 
in evaluations. Therefore, according to this definition, a measure based on cost terms [15] as free parameters does not 
lead to an objective evaluation. Definition 1 may be conservative, but nevertheless, provides technical simplicity for 
examining "objectivity" or "subjectivity" directly with respect to the existence of free parameters. In some situations. 
Definition 1 can be relaxed by including free parameters, but they all have to be determined solely from the given 
dataset. 

Definition 2. Datasets in classification evaluations with a reject option. A reject option is sometimes considered 
for classifications in which one may assign samples to a reject or unknown class. Evaluations of classification with a 
reject option apply two datasets, namely, the output (or prediction) dataset {ytl^'^p which is a realization of discrete 
random variable F valued on set {1,2, . . .,m-H 1); and the target dataset {tk]'l^y e T valued on set {1,2, . . .,m); where n 
is the total number of samples, and m is the total number of classes. A sample identified as a reject class is represented 
hyyt = »j+ 1. 

Remark 2. The term "abstaining classifiers" has been widely used in classification problems with a reject option 
111211 ||4J . However, most studies of abstaining classifications required cost matrices for their evaluations. The definition 
given above exhibits more generic scenarios in classification evaluations, because it does not require knowledge of 
cost terms for error types and reject types. 

Definition 3. Augmented confusion matrix and its constraints |11]. An augmented confusion matrix includes 
one column for the reject class, which is added to a conventional confusion matrix: 



C = 



Cll 


cn . 


■ Cxm 


Cl(m+1) 


C21 


Cll ■ 


■ Cim 


Cl{m+\) 


Cm\ 


Cml ■ 


Cmm 


Cm{m+\) 



(2) 



where c^ represents the sample number of the /th class that is classified as the yth class. The row data corresponds to 
the actual classes, while the column data corresponds to the predicted classes. The last column represents the reject 
class. The relations and constraints of an augmented confusion matrix are: 

m+l 

Cj = Yj ^ii^ Ci > 0, Cij > 0, / = 1 , 2, . . . , m, (3) 
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where C, is the total number for the /th class, which is generally known in classification problems. 

Definition 4. Error types and reject types. Following the conventions in binary classifications [? ], we denote 
Ci2 and C21 by ''Type I Error" and "Type II Error" respectively; cn and C23 by ''Type I Reject" and "Type II Reject" 
respectively. 

Definition 5. Normalized information measure. A normalized information measure, denoted as NUT, Y) e 
[0, 1], is a function based on information theory, which represents the degree of similarity between two random 
variables T and Y. 



In principle, we hope that all NI measures satisfy the three important properties, or axioms, of metrics 111 511 11351 
supposing Z is another random variable: 

PI: NI{T, Y)=liffT ^Y (the identity axiom) 

P2: NI{T, Y) + NI(YZ) > NI{T,Z) (the triangle inequality) 

P3: NI{T, Y) = NUY, T) (the symmetry axiom) 

Remark 3. Violations of properties of metrics are possible in reaching reasonable evaluations of classifications. 
For example, the triangle inequality and symmetry properties can be relaxed without changing the ranking orders 
among classifications if their evaluation measures are applied consistently. However, the identity property is indicated 
only for the relation T - Y (assuming T is padded with zeros to make it the same size as Y), and does not guarantee 
an exact solution {tk - yt) in classifications (see Theorems 1 and 4 given later). If a violation of metric properties 
occurs, the NIs are referred to as measures, rather than metrics. 

For classification evaluations, we consider the generic properties of metrics not to be as crucial in comparisons as 
certain specific features. In this work, we focus on specific features that, though not mathematically fundamental, are 
more necessary in classification applications. To select "better" measures for objective evaluations of classifications, 
we propose the following three desirable features together with their heuristic reasons. 

Feature 1. Monotonicity witli respect to tlie diagonal terms of the confusion matrix. The diagonal terms of 
the confusion matrix represent the exact classification numbers for all the samples. Or, they reflect the coincident 
numbers between t and y from a similarity viewpoint. When one of these terms changes, the evaluation measure 
should change in a monotonous way. Otherwise, any non-monotonic measure may fail to provide a rational result 
for ranking classifications correctly. This feature is originally proposed for describing the strength of agreement (or 
similarity) if the matrix is a contingency table |32]. 

Feature 2. Variation with reject rate. To improve classification performance, a reject option is often used in 



engineering applications 11211 . Therefore, we suggest that a measure should be a scalar function on both classification 
accuracy and reject rates. Such a measure could be evaluated based solely on a given confusion matrix from a single 
operating point in the classification. This is different to the AUC measures that are based on an "Error-Reject" curve 
1 1611 12411 from multiple operating points. 



Feature 3. Intuitively consistent costs among error types and reject types. This feature is derived from the 
principle of our conventional intuitions when dealing with error types in classifications. It is also extended to reject 
types. Two specific intuitions are adopted for binary classifications. First, a misclassification or rejection from a small 
class will cause a greater cost than that from a large class. This intuition represents a property called "within error 
types and reject types". Second, a misclassification will produce a greater cost than a rejection from the same class, 
which is called "between error and reject types" property. If a measure is able to satisfy the intuitions, we refer to its 
associated costs as being "intuitively consistent" . Exceptions may exist to the intuitions above, but we consider them 
as a very special case. 

At this stage, it is worth discussing on "objectivity" in evaluations because one may doubt correctness of the 
intentions above and the terms "desirable" or "intuitions" in a study of objective evaluations. The three features seem 
to be "problematic" in terms of providing a general concept of "objectivity", because no human bias should be applied 
in the objectivejudgment of evaluation results. The following discussions justify the proposal of requiring desirable, or 
proper, features for objective measures. On one hand, we recognize that any evaluation will imply a certain degree of 
"subjectivity", since evaluations exist only as a result of human judgment. For examples, every selection of evaluation 
measures, even of objective ones, will rely on possible sources of "subjectivity" from users. On the other hand, 
engineering applications do concern about objective evaluations 12911 13211 . However, to the authors' best knowledge, 
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a technical definition, or criterion, seems missing for determining objective or subjective measures in evaluations of 
classifications. For overcoming possible confusion and vagueness, we set Definition 1 as a practical criterion for 
examining whether a classification evaluation holds "objectivity' or does not. If a measure satisfies this definition, 
it will always retain the property of "objective consistency" in evaluating the given classification results. The three 
"desirable" features, though based on "intuitions" with "subjectivity", do not destroy the criterion of "objectivity" in 
classification evaluations. Therefore, it is logically correct to discuss "desirable" features of objective measures as 
long as the measures satisfy Definition 1 for keeping the defined "objectivity". 

Note that all desirable features above are derived from our intuitions on general cases of classification evaluations. 
Other items may be derived for a wider examination of features. For example, Forbes 1.29.1 proposed six "constraints 
on proper comparative measures" , namely, "statistically principled, readily interpretable, generalizable to k-class sit- 
uations, not different to the special status, reflective of agreement, and insensitive to the segmentation" . However, we 
consider the three features proposed in this work to be more crucial, especially as Feature 3 has never been concerned 
in previous studies of classification evaluations. Although Features 2 and 3 may share a similar meaning, they are pre- 
sented individually to highlight their specific concerns. We can also call the desirable features "meta-measures" , since 
these are defined to be qualitative and high-level measures about measures. In this work, we apply meta-measures in 
our investigation of information measures. The examination with respect to the meta-measures enables clarification 
of the causes of performance differences among the examined measures in classification evaluations. It will be helpful 
for users to understand advantages and limitations of different measures, either objective- or subjective-ones, from a 
higher level of evaluation knowledge. 

4. Normalized Information Measures based on Mutual Information 

All NI measures applied in this work are divided into one of three groups, namely, mutual-information based, 
divergence based, and cross-entropy based groups. In this section, we focus on the first group. Each measure in this 
group is derived directly from mutual information representing the degree of similarity between two random vari- 
ables. For the purpose of objective evaluations, as suggested by Definition 1 in the previous section, we eliminate all 
candidate measures defined from the Renyi or Jensen entropies ll36ll ll9ll since they involve a free parameter Therefore, 
without adding free parameters, we only apply the Shannon entropy to information measures OTIl : 

H(Y)^-Y^p(y)\Qg^p{y), (4) 

y 

where F is a discrete random variable with probability mass function p{y). Then mutual information is defined as 



11370: 

where p{t, y) is the joint distribution for the two discrete random variables T and Y, and p{t) and p{y) are called 
marginal distributions that can be derived from: 

Pit) - Yj p(f^ y^' p(y^ = Z p^^' y^- (^) 

y t 

Sometimes, the simplified notations for pij - p(t,y) = p(t = f,-,y = yj) are used in this work. Table 1 lists the 
possible normalized information measures within the mutual-information based group. Basically, they all make use of 
Eq. (5) in their calculations. The main differences are due to the normalization schemes. In applying the formulas for 
calculating Nik, one generally does not have an exact p{t,y). For this reason, we adopt an empirical joint distribution 
defined below for the calculations. 



Definition 6. Empirical joint distribution and empirical marginal distributions llllll . An empirical joint 
distribution is defined from the frequency means for the given confusion matrix, C, as: 

P^(t, y) = (Pii)^ = -di, / = 1 , 2, . . . , OT, / = 1 , 2, . . . , m + 1 , (7a) 

n 
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where n - Yj C,, denotes the total number of samples in the classifications. The subscript "e" is given for denoting 
empirical terms. The empirical marginal distributions are: 

p^(t = ti) = —, i = 1,2,. ..,;«. (7b) 

n 

1 '" 
Pe{y^yj)^-Yj''ij, 7=l,2,...,m+l. (7c) 

" /=i 
Definition 7. Empirical mutual information lllll . The empirical mutual information is given by: 

p ^ s. m m+1 

/=l 

Definitions 6 and 7 provide users with a direct means for applying information measures through the given data 
of the confusion matrix. For the sake of simplicity of analysis and discussion, we adopt the empirical distributions, 
or pij X Pij, for calculating all NIs and deriving the theorems, but removing their associated subscript "e". Note that 
the notation of NI2 in Table 1 differs from the others for calculating mutual information, where Im(T, Y) is defined as 
"modified mutual information" , The calculation of Im{T, Y) is carried out based on the intersection of T and Y. Hence, 
when using Eq. (8), the intersection requires that Im{T, Y) incorporate a summation of j over 1 to m, instead of m + 1. 
This definition is beyond mathematical rigor, but Nh has the same properties of metrics as NI\ . It was originally 
proposed to overcome the problem of unchanging values in NIs if rejections are made within only one class (referring 
to M9-M10 in Table 3, 1111 ). The following three theorems are derived for all NIs in this group. 

Theorem 1. Within all NI measures in Table 1, when NI{T, Y) = 1, the classification without a reject class may 
correspond to the case of either an exact classification (y^ - ft), or a specific misclassification (y^ ^ f^). The specific 
misclassification can be fully removed by simply exchanging labels in the confusion matrix. 

Proof. If NI{T, Y) = 1, we can obtain the following conditions from Eq. (8) for classifications without a reject 
class: 

C, 

Pij — p(t — tj) a: Pe(t — ti) — — and pkj — 0, i, j,k — 1,2, ... ,m, k 4^ i. (9) 

These conditions describe the specific confusion matrix where only one non-zero term appears in each column (with 
the exception of the last {m + l)th column). When j - i, C is a diagonal matrix for representing an exact classification 
(yk - tk). Otherwise, a specific misclassification exists for which a diagonal matrix can be obtained by exchanging 



labels in the confusion matrix (referring to Ml 1 in Table 4, 111 111 '). ♦ 

Remark 4. Theorem 1 describes that NI(T,Y)=1 presents a necessary, but not sufficient, condition of an exact 
classification. 

Theorem 2. For abstaining classification problems, when NI(T, Y) = 0, the classifier generally reflects a misclas- 
sification. One special case is that all samples are considered to be one of w classes, or be a reject class. 

Proof. For NIs defined in Table 1, NI(T,Y) ^ Q ijf I(T,Y) = 0. According to information theory Q, the 
following conditions can hold based on the given marginal distributions (or the empirical ones if a confusion matrix 
is used): 

I(T,Y)^0, iff p(t,y) ^ p(t)p(y). (10) 

The conditional part in Eq. (10) can be rewritten in the form pij - p(t - ti)p(y = yj). From the constraints in (3), 
p(t - ti) > (/ = 1 , 2, . . . , m) can be obtained. For classification solutions, there should exist at least one term for 
P(y - yj) > ^ (j - 1, 2, . . . , m -H 1). Therefore, at least one non-zero term for pij > (/ ^ j) must be obtained. This 
non-zero term corresponds to the off-diagonal term in the confusion matrix, which indicates that a misclassification 
has occurred. When all samples have been identified as one of the classes (referring to M2 in Table 4, Ml If ). NI - 
should be obtained. <> 

Remark 5. Eq. (10) gives the statistical reason for zero mutual information, that is, the two random variables are 
"statistically independent" . Theorem 2 demonstrates an intrinsic reason for local minima in NIs. 
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Theorem 3. The NI measures defined by the Shannon entropy generally do not exhibit a monotonic property with 
respect to the diagonal terms of a confusion matrix. 

Proof. Based on 1.11.1 . we arrive at simpler conditions for the local minima about I{T, Y) for the given confusion 
matrix: 



C = 





Cij c,-,,+i 

c,+i,; c,+i,,+i 





// S!:^^^!:!!^. (H) 

C;+i,; C/+i,i+i 



The local minima are obtained because the four given non-zero terms in Eq. (11) produce zero (or the minimum) 
contribution to I{T, Y). Suppose a generic form is given for NUT, Y) - g(I(T, Y)), where g{- ) is a normalization 
function. From the chain rule of derivatives, it can be seen that the conditions do not change for reaching the local 
minima. ♦ 

Remark 6. The non-monotonic property of the information measures implies that these measures r nay suffer 
from an intrinsic problem of local minima for classification rankings (referring to M19-M20 in Table 4, fill). Or, 
according to Feature 1 of the meta-measures, a rational result for the classification evaluations may not be obtained 
due to the non-monotonic property of the measures. This shortcoming has not been theoretically derived in previous 
studies (Qllliil). 



5. Normalized Information Measures based on Information Divergence 

In this section, we propose normalized information measures based on the definition of information divergence. 
In Table 2, we summarize the commonly-used divergence measures, which are denoted as Dit(T, Y) and represents 
dissimilarity between the two random variables T and Y. In Sections 5 and 6, we apply the following notations for 
defining marginal distributions: 

Pt(z) = p,(t = z) = pit), and py{z) = Py(y = z) = p{y), (12) 

where z is a possible scalar value that f or y can take. For a consistent comparison with the previous normalized 
information measures, we adopt the following transformation on Dt 13111 : 

Nh = exp(-D,). (13) 

This transformation provides both inverse and normalization functionalities. It does not introduce any extra param- 
eters, and presents a high degree of simplicity, as in derivation for examining the local minima. Two more theorems 
are derived by following a similar analysis to that in the previous section. 

Theorem 4. For all NI measures in Table 2, when NUT, F) = 1, the classifier corresponds to the case of either 
an exact classification, or a specific misclassification. Generally, the misclassification in the latter case can not be 
removed by switching labels in the confusion matrix. 

Proof. When Py{z} - pAz), it is always the case that NUT, F) = 1. However, general conditions can be given for 
Py{z) - Pt(z) as follows: 

Py(y = Zi) ^ p,(t ^ Zi), or 2_^pjj^2_^pij, i^\,2,...,m. (14) 

Eq. (14) implies two cases of classifications for Dk{T, F) = 0, A: = 10, . . . , 20, One of these corresponds to an exact 
classification (or y^ - tk), while the other is the result of a specific misclassification that shows the relationship of 
yt + til, but Pyiz) = Priz)- In the latter case, switching labels in the confusion matrix to remove misclassification 
generally destroys the relation for Pyiz) = Priz) at the same time. Considering the relation is a necessary condition for 
a perfect classification, the misclassification cannot be removed through a label switching operation. ♦ 

Remark 7. Theorem 4 suggests the caution should be applied in explaining the classification evaluations when 
NUT, Y) - I. The maximum of the NIs from the information divergence measures only indicates the equivalence 
between the marginal probabilities, pyiz) - Pt{z), but this is not always true for representing exact classifications (or 



yk = til). Theorem 4 reveals an intrinsic problem when using an NI as a measure for similarity evaluations between 
two datasets, such as in image registration. 

Theorem 5. The NI measures based on information divergence generally do not exhibit a monotonic property 
with respect to the diagonal terms of confusion matrix. 

Proof. The theorem can be proved by examining the existence of multiple maxima for NI measures based on 
information divergence. Here we use a binary classification as an example. The local minima of Dit are obtained 
when the following conditions exist for a confusion matrix: 



C = 



Ci - d[ d\ 
^2 C2-d2 



and d\ — di, (15) 



where d\ and J2 are integer numbers (> 0) for misclassified samples. The confusion matrix in Eq. (15) produces zero 
divergence Dk and therefore, Nh - 1 . However, changing from d\ + d^ always results in Nik < 1 . ♦ 

Remark 8. Theorem 5 indicates another shortcoming of NIs in the information divergence group from the view- 
point of monotonicity. The reason is once again attributed to the usage of marginal distributions in calculations of 



divergence. The shortcoming has not been reported in previous investigations ( II31I1II35I1 ). 



6. Normalized Information Measures based on Cross-Entropy 

In this section, we propose normalized information measures based on cross-entropy, which is defined for discrete 
random variables as llOfl: 

H(T-J)^-Y^pAz)\og^Py{z\ or H(Y-T)^-Y^py{z)\og2p,{z). (16) 



Note that H{T; Y) dififers from joint-entropy H{T, Y) with respect to both notation and definition, and is given as Il37ll : 



H{T,Y) = -^^p(f,3;)log2p(f,y). (17) 

In fact, from Eq. (16), one can derive the relation between KL divergence (see Table 2) and cross-entropy: 

H(T;Y) = H{T) + KL(T,Y), or H(Y;T) ^ H{Y) + KL{Y,T). (18) 

If H{T) is considered as a constant in classification since the target dataset is generally known and fixed, we can 
observe from Eq. (18) that cross-entropy shares a similar meaning as KL divergence for representing dissimilarity 
between T and Y. From the conditions H >Q and KL > 0, we are able to realize the normalization for cross-entropy 
shown in Table 3. Following similar discussions as in the previous section, we can derive that all information measures 
listed in Table 3 will also satisfy Theorems 4 and 5. 

7. Numerical Examples and Discussions 

This section presents several numerical examples together with associated discussions. All calculations for the 
numerical examples were done using the open source software Scilab^ and a specific toolboxQ The detailed imple- 
mentation of this toolbox is described in |38]. Table 4 lists six numerical examples in binary classification problems 
according to the specific scenarios of thek confusion matrices. We adopt the notations from 1 39] for the terms "correct 
recognition rate (CR)", ''error rate (E)", and "reject rate (Rej)" and their relation: 

CR + E + Rej=l. (19) 

In addition, we define "accuracy rate (A)" as 



http : //www . scllab . org 
^ The toolbox is freely available as the file "confinatrix2ni.zip" at (http : //www ■ openpr ■ org ■ cn^. 



CR 

A = . (20) 

CR + E ^ ^ 

The first four classifications (or models) M1-M4 are provided to show the specific differences with respect to error 
types and reject types. In this work, we do not concern classifiers applied (say, neural networks or support vector 
machines) for evaluations, but only the resulting evaluations from these classifiers. In real applications, it is common 
to encounter ranking classification results as in Ml to M4. The first two classifications of Ml and M2 share the same 
values for the correct recognition and accuracy rates (CR = A = 99%). The other two classifications, for M3 and 
M4, have the same reject rates (Rej =1%) and correct recognition rates (CR = 99%). The accuracy rates for M3 
and M4 are also the same (A = 100%). This definition is consistent with the conventions in the study of "Accuracy- 
Reject" curves [16]. If neglecting the specific application backgrounds, users generally have a ranking order for the 
four classifications so that the ''besf one is selected. The data from other conventional measures, such as Precision, 
Recall and F 1 , are also given in Table 4. Without using extra knowledge about the cost of different error types or reject 
types, the conventional performance measures are not possible to rank the four classifications, M1-M4, properly. 

According to the intuitions of Feature 3 proposed in Section 3, one can gain two sets of ranking orders for the four 
classifications Ml to M4 in forms of: 

!R(M2) > %(Ml), %(M4) > %(M3), (21 - a) 

%(M4) > %(M2), %(M3) > 'K(Ml), (21 -b) 

where we denote %(•) to be a ranking operator, so that 5l(M,) > 'K(Mj) expresses M, is better than Mj in ranking. 
From eq. (21), one is unable to tell the ranking order between M2 and M3. For a fast comparison, a specific letter is 
assigned to the ranking order of each model in Table 4 based on eq. (21): 

%(M4) = A, 'K(M3) = B, %(M2) = B, %(Ml) = C. (22) 

The top rank "A" indicates the "best" classification (M4 in this case) of the four models. Table 4 does not distinguish 
ranking order between M2 and M3. However, numerical investigations using information measures will provide the 
ranking order from the given data. The other two models, M5 and M6, are also specifically designed for the purpose 
of examining information measures on Theorems 3 and 5 (or Feature 1), respectively. 

Tables 5 and 6 present the results on information measures for M1-M6, where the ranking orders among M1-M4 
is based on the calculation results of NIs with the given digits. The following observations are achieved from the 
solutions to the examples. 

1) When normalization functions include the term H(Y) for the mutual information group, the associated NI pro- 
duces the desirable feature of a variation in reject rate. Nh is effective for this feature even if it only uses 
H(T) for its normalization. The effectiveness is attributed to the definition of Im(T, Y) for calculating mutual 
information based on the intersection of T and Y. 

2) The results of M5 and M6 confirm, respectively. Theorem 3 for local minima and Theorem 5 for maxima of 
NIs. The existence of multi extrema indicates the non-monotonic property with respect to the diagonal terms of 
the confusion matrix, thereby exhibiting an intrinsic shortcoming of the information measures. 

3) For classifications Ml to M4, the meta-measure of Feature 3 suggests ranking orders as shown in eqs. (21) or 
(22). However, of all the measures in the three groups only Nh shows any consistency with the intuitions from 
the given examples (Tables 5 and 6). This result indicates that Feature 3 seems to be a difficult property for 
most information measures. 

4) None of the performance or information measures investigated in this work fully satisfy the meta-measures. 
Examining data distinguishability in Ml through M4, we consider the information measures from the mutual- 
information group to be more appropriate than those of the other groups (say, A^/12 and NI22 do not show 
significant distinguishability, or value differences, to the four models). 
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The fourth observation supports the proposal of meta-measures for a higher level of classification evaluations. 
The meta-measures provide users with a simple guideline of selecting "proper'' measures from their specific concerns 
of applications. For example, the performance measures (such as A, E, CR, Fl, or AUC) satisfy Feature 1, but fail 
directly to distinguish error types and reject types in an objective evaluation. When Feature 2 or 3 is a main concern, 
the information measures exhibited to be more effective, despite them not being perfect. 

Of all the information measures investigated, Nh is shown to be the "best" for the given examples in terms of 
Feature 3. Therefore, more detailed studies, from both theoretical and numerical ones, were made on this promising 
measure. The theoretical properties of this measure was derived in Appendix A. While Theorem Al confirms that 
NI2 satisfy Feature 3 around the exact classifications. Theorem A2 indicates that this measure is able to adjust the 
ranking order between a misclassification of a large class and a rejection of a small class. Table 7 shows two sets of 
confusions matrices which are similar to M1-M4 in Table 4. One can observe the changes of ranking orders among 
them. These changes numerically confirm Theorem A2 and its critical point, or cross-over point (Q. - Ci/n ^ 0.942), 
for the given data. 

Further investigations were carried out on three-class problems. Although some NIs could be removed directly 
based on their poor performance with respect to the meta-measures (such as NIi and A^/9 on Feature 2), they were 
retained to demonstrate pros and cons in the applications. At this stage, we extend the concepts of error types and 
reject types to multiple classes. Nine examples are specifically designed in Table 8. The ranking order for each model 
is shown in Table 8, which is derived from the intuitions of Feature 3. From Tables 9 and 10, it is interesting to see 
that NI2 is still the most appropriate measure for classification evaluations. Using this measure, we can select the 
"besf and "worsf classifications consistent with our intuition. All other measures perform below our satisfactions 
for distinguishing error types and reject types properly. 

The numerical study supports the viewpoint that no universally superior measure exists. For example, in com- 
paring with information measure A^/2, the conventional accuracy measure satisfies Feature 1, but does not qualify 
to Feature 3. Thus, any measure, either performance -based or information-based, should be designed and evaluated 
within the context of the specific applications. It is evident that the desirable features in the specific applications 
become more crucial (or "proper') for evaluation measures than some generic mathematical properties. For example, 
information measures (such as KL divergence), that may not satisfy a metric's properties (say, symmetry), are able to 
process classification evaluations including a reject option. They provide more applicable power than the conventional 
performance measures in abstaining classifications. However, we still need a complete picture about information mea- 
sures with respect to their advantages as well as limitations. The examples in Tables 4, 7, and 8 only present limited 
scenarios for variations in confusion matrices. Using the open-source toolbox from 113 811 . one is able to test more 
scenarios for numerical investigations. 

8. Summary 

In this work, we investigated objective evaluations of classifications by introducing normalized information mea- 
sures. We reviewed the related works and discussed objectivity and its formal definition in evaluations. Objective 
evaluations may be required under different application background. In classifications, for example, exact knowledge 
of misclassification costs is sometimes unknown in evaluations. Moreover, cases of ignorance regarding reject costs 
appear more often in scenarios of abstaining classifications. In these cases, although subjective evaluations can be 
applied, the user-given data of the unknown abstention costs will lead to a much higher degree of uncertainty or in- 
consistency. We believe that an objective evaluation can be a suitable solution, as well as a complementary, approach 
to subjective evaluations. In some situations, an objective evaluation is considered useful despite the subjective evalu- 
ations being reasonable for the applications. The results from both objective and subjective evaluations give users an 
overall quality of classification results. 

Considering that abstaining classifications are becoming more popular, we focused on the distinctions of error 
types and reject types within objective evaluations of classifications. First, we proposed three meta-measures for 
assessing classifications, which seem more relevant and proper than the properties of metrics in the context of clas- 
sification applications. The meta-measures provide users with useful guidelines for a quick selection of candidate 
measures. Second, we tried systematically to enrich a classification evaluation bank by including commonly used 
information measures. Contrary to the conventional performance measures that apply empirical formulas, the infor- 
mation measures are theoretically more sound for objective evaluations of classifications. The key advantage of these 
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measures over the conventional ones is their abihty to handle multi-class classification evaluations with a reject op- 
tion. Third, we revealed theoretically the intrinsic shortcomings of the information measures. These have not been 
formally reported before in studies of image registration, feature selection, or similarity ranking. The discovery of 
these shortcomings is very important for users to interpret their results correctly when applying those measures. 

Based on the principle of the 'Wo Free-Lunch Theorem" lllSll . we recognize that there are no "universally superior" 
measures |5]. It is not our aim to replace the conventional performance measures, but to explore information measures 
systematically in classification evaluations. The theoretical study demonstrates the strength and weakness of the 
information measures. Numerical investigations, conducted on binary and three-class classifications, confirmed that 
objective evaluations are not an easy topic in the study of machine learning. One of the most challenging tasks will be 
an exploration of novel measures that satisfy all meta-measures as well as the metric properties in objective evaluations 
of classifications. It is also necessary to define the ''ranking order" intuitions among error types and reject types in 
generic classifications, which will form the basis of the quantitative meta-measures. However, this task becomes more 
difficult if classifications are beyond two classes. 
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Appendix A. Theorems and Sensitivity Functions of A^/2 for Binary Classifications 
Theorem Al: For a binary classification defined by: 



TN FP RN 
FN TP RP 



and 



d 




Ci-d 


, M2 = 


Cy-d d 

C2 






Ci 





Ci-d d 


, M4 = 


Ci-d 
C2 


d 




Ci^TN + FP + RN, C2^FN + TP + RP, Ci+Ci^n 

NI2 satisfies Feature 3 on the property regarding error types and reject types around the exact classifications 
cally for the four confusion matrices below: 

Ml = 
Mi = 

the following relations will be held: 

NhiMi) < NhiMi) and NhiM^) < NhiM^), 

NhiMi) < NhiMi) and NhiMx) < Nh{M^), 
where C\ > C2 > d > 0. 
Proof. For a binary classification, NI2 is defined by the modified mutual information: 



(Al - a) 

(Al-b) 
Specifi- 



(A2) 



W/2 = 



IMJTJ) 
H(T) 



and 



Im(T,Y)= ^log2-^ 



nTN 



FP 



+ ^l0g2 



(TN+FN) 



Vl0g2 



iiFP 
2 Ci(TP+FP) 



(A3 - a) 

(A3 - b) 
(A3 - c) 



(A4) 



C2(FN+TN) ' n '"&2 CiiFP+TP) ' 

Let Mo be a confusion matrix corresponding to the exact classifications: 



Mo 



Ci 
C2 



(A5) 
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Based on the definition of Im in (A4), one can calculate the mutual information differences between two models. 
Considering Mq to be a baseline, we obtain the analytical results below for the four models: 

A/io = /m(Mi) - /m(Mo) = -(Ci log2 ^r^ + dlog^ 77^), (A6 - a) 



Ci 
Cx+d 


+ (ilog2 


d 


i> 


C2+d 


+ (ilog2 


d 

C2 + 


-? 



A/20 = Im{M2) - Im(Mq) = -(C2 log2 — ^ + d log2 77^). (A6 - b) 

A/30 = /m(M3) - /m(Mo) = -(log2 — ), (A6 - c) 

n n 

A/40 = /m(M4) - /m(Mo) = -(log2 — ), (A6 - d) 

n n 

For the given assumption C\ > C2 > d > 0, all A/s above are negative values so that their abstracts represent the 
absolute costs in classifications. One can directly prove that IA/30I > IA/40I from (A6-c) and (A6-d). The procedures 
for the proof of |A/io| > IA/20I are given below. First, one needs to confirm the following two functions to be strictly 
decreasing (xi < X2,g(xi) > g(x2)): 

gl(x) = (^^)^ and g2(x) = (-^y for x > 0, d > 0. (A7 - a) 

X + d X + d 

Then, from the monotonically decreasing property of (A7-a), one can derive the following relations: 

Ci > C2 ^ (^)^' < (c^)^^ < 1 and (^Y<(^Y<\ 

-^|C2l0g2clb+'^l0g2c6l<^|Cll0g2c^+«'l0g2cfbl (^7 - b) 

-^ IA/20I < lA/iol 

The relations in (A3-a) are achieved for Nh because its normalization term, H(T), is a constant for the given Ci and 
€2- One therefore confirms the satisfaction of Feature 3 on the property of the within error types and reject types 
around the exact classifications, respectively. 

Then it is a proof of the relation (A3-b), which suggests that a misclassification suffer a higher cost than a rejection 
for the same class. Feature 3 considers this relation as a basic property in classifications for the between error and 
reject types. The procedures for the proof are: 



Ci > C2 — > C1C2 + Cid > (Ci + C2)d - nd 

1 > ^ > ^ ^ l0g2(^)| < |l0g2(^) 



IA/40I < IA/20I 



'^loga c^l < \ \C2 log2 ^ + ^log2 cibl 



(A8 - a) 



Ci+d<n^ Ci(Ci +d) + nd< Cin + nd -^ '^' 'i'cif rf)"'' < ^ 

-^- + -rT7i<^-^r^<-<^^ |l0g2 ^1 < log, 7^,\ 

n Ci+a Ci+a n 1 c'Z ^ ] j fez C\+d\ 

-^ i Id log, ^1 < ■!■ ICi log, 7% +d\og. ^\ 

^ IA/30I < lA/iol . 



(A8 - b) 



<> 

Theorem A2: For the given conditions (A1)-(A2) and C\ > C2 > d > 0, Nh will satisfy the following relations: 

NhiMi) > NhiMi) > Nl2{M2) > NhiMi) for 0.5 <pi <Q.<1 (A9 - a) 

NhiMA,) > Nl2{M2) > NhiM-i) > NhiMi) for Q.5<Q.<pi<\ (A9 - b) 

where we set p\ - C\ /n, and Q. is an upper boundary for the validation of (A9-a). 

Proof. The first relation describes that the ranking order in (A9-a) is valid only for a certain range of pi. The 
lower boundary is resulted from the assumption of Ci to be a large class. The upper boundary, Q, is determined by 
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p1(%) 



Figure Al.: Plots of "A/ vs. pi{%y' when n = 100 and d = 1. (Black-Solid = A/io, Black-Dash=A/2o, Blue-Solid=A/3o, Blue-Dash=A/4o) 



the cross-over point between the functions of (A-6b) and (A-6c). For better understanding of the relations (A9), we 
present the plots of "A/ vs. /?i" when n = 100 and d - I (Fig. Al). 

For examining the validation range of ( A9-a), one needs to calculate the cross-over point from solving the equation 
below: 

/ = A/20 - A/30 = -(Ci log2 77^ + d\og2 -^) = 0. (AlO) 

n C2 + a C2 + a 

There exists no closed-form solution to O. Based on the monotonicity of the related functions and relations in (A3), 
one is able to confirm the conditions in (A9-a) and (A9-b), respectively. Fig. Al depicts numerically that only a single 
cross-over point appears to the range of pi > 0.5(or Ci > C2). 

Remark Al: We can denote Q(n, d) to be the cross-over point obtained from /, with two independent variables 
n and d. The value of Q increases with n, but decreases with d. A numerical solution to Q should be engaged. 
The physical interpretation of Q is a critical point at which a rejection within a small class has the same cost with 
a misclassification within a large class. This situation generally does not occur except for classifications of largely- 
skewed classes (say, Ci >> C2). Therefore, we call the ranking order in (A9-a) is a general ranking order, and one in 
(A9-b) is a largely-skewed-class ranking order 

Sensitivity functions: The sensitivity functions are given as the conventional forms for delivering approximation 
analysis of /«: 

''" '^' " ' ^"^ ■)sngiTNl (All -a) 



dTN 


n 


'"^^ C, 


' V"""^ TN + FN 


diM 


1 

n 


1 " 
log2 7r 

^2 




dTP 


'r^'TP+FP 


diM 
dFN 


1 
n 


1 " 

log2 7r 

L2 


/. FN 

+ logo 

I ^^ FN + TN 


diM 


1 
n 


1 " 

log2 7r 




dFP 


'r^'FP+TP 






diM 


dl dl 






dRN 


dTN dFP' 
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)sngiTP) 
) sng(FN) 
)sng(FP) 



(All-b) 



(All -c) 



(All-d) 



(All-e) 



diM dl dl 

Tr^=-T T — . (All-f) 

dRP dFN dTP 

where sgn{) is a sign function for satisfying the definition of H{0) = 0. Only four independent variables describe the 
sensitivity functions due to the two constraints in (Al-b). Hence, a chain rule is applied for deriving the functions of 
(All-e) and (All-f). 

Remark A2: Using eq.(All), we failed to reach the reasonable conclusions as those in Theorems Al for the 
reason that the first-order differentials may be not sufficient for the analysis around the exact classifications. For 
example, we got the results for: 

--^Iog2(^)+^log2(^) = 0. ^ "^ 

I(M2) - /(Mo) ^(TNi- TNo) ^^ +(FPi- FPo) '-^ r A 1 9 - h^ 

This observation suggests that one needs to be cautious when using sensitivity function for approximation analysis on 
iMiovNh). 
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Table 1 : NI measures within the mutual-information based group. 



No. Name [Reference] 



Formula on NI, 



NI based 
NI based 
NI based 
NI based 
NI based 
NI based 
NI based 
NI based 
NI based 



on mutual 
on mutual 
on mutual 
on mutual 
on mutual 
on mutual 
on mutual 
on mutual 
on mutual 



information [28] 
information [11] 
information [28] 
information 
information [26] 
information [401 
information [411 
information [261 
information [26] 



NIi(T,Y)- 
NhiT, Y) ■■ 
NhiT, Y) ■■ 
NhiT, Y) : 
NI<,(T,Y)- 
Nh{T,Y): 
NhiT, Y) - 
NhiT, Y) 
NhiT, Y) ■■ 



l{T.Y) 
H{T) 
IMJTX) 

H(T) 
ItJ.Y) 

my) 

2 [ H(T) mx) J 
2/(7,7) 

mr)+H(Y) 

HT.Y) 



iH(T)HX) 
I(TX) 

mjx) 

HTX) 

nmx(H(T),H(Y)} 

IjTX) 
mm(H(T),H(Y)) 
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Table 2: Information measures within the divergence based group. 



No. Name of Dj [Reference] 



Formula on D^ (NI^ = exp(-Dt)) 



10 



16 

17 

18 
19 

20 



ED-Quadratic Divergence [^ Du, = QDed(T, y) = 2 (p,(z) - Py{z)f 



1 1 CS-Quadratic Divergence [9] 

12 KL Divergence [42] 

13 Bhattacharyya Distance [43] 

14 x~ (Pearson) Divergence [44] 

15 Hellinger Distance [44] 
Variation Distance [44] 
J divergence j [45] 

L (or JS) divergence [45] 
Symmetric ;^'^ Divergence [46] 

Resistor Average Distance [43] 



Dn=QDcs(T,Y) = \oz^f. 



i.p,{zfi,p,(zr 



[I.(p,(z)Py(z))V 



Di2 = KL(T,Y) = Y.P,(znog. 



P,(z) 

2 Py(z) 



Dn = Db{T, Y) = - log, 2 ^P,{z)Py(z) 



Du=xHT,Y) = Z 



(Pr(z)-Py(z))' 

iMz) 



Di5 = H\T, y) = 2 ( VMi) - ylMz)f 
Dif, = V{T,Y) = f\p,{z)-Py(z)\ 



Py(z} 

Pdz) 



D„ = J(T, K) = 2 P.(z) log2 ^ + 2 Py(z) log2 

Dig = L{T, Y) = KL(T, M) + KL(Y, M), M = ""'''^''''''" 



Di,=xj(T,Y) = j: 
D2o = Dka(T,Y) 



(P,(z)-Py(z)y y (Py(z)-P,(z)f 

pvfc) 4" 

KL(T.Y}KL{Y,T) 
KL(T,Y)+KL(Y,T) 



Piiz) 



Table 3: NI measures within the cross-entropy based group. 



No. Name 



Formula on Nh 



21 NI based on cross-entropy ^21 

22 NI based on cross-entropy NI22 

23 NI based on cross-entropy Nhi 

24 NI based on cross-entropy Nhn 



H(T) 



,H{T;Y) 



-„H{Y-T) 



-'LPtiZ)l0g2Pyiz) 
■ 2 Pvfe) l0g2 P'fe) 



H(T) 



+ 



H(Y) 



HA 

■X)) 



2 \H(.T,Y) ' H(Y;T) 

H(T)+H(Y) 
H(T;Y)+H(Y;T) 



Table 4: Numerical examples in Binary Classifications(Ml-M4 and M6: Ci 
model, where R = A,B, ■■■, in descending order from the top. 



90, C2 = 10; M5: Ci = 95, C2 = 5). (R)= ranking order for the 



Model Ml 
(Ranking) (C) 



M2 
(B) 



M3 
(B) 



M4 
(A) 



M5 



M6 



c 


90 








89 


1 





90 








89 





1 


57 


38 





89 


1 





1 


9 








10 








9 


1 





10 





3 


2 





1 


9 





CR 0.990 




0.990 




0.990 




0.990 




0.590 




0.980 






Rej 0.000 




0.000 




0.010 




0.010 




0.000 




0.000 






Precision 0.989 




1.000 




1.000 




1.000 




0.950 




0.989 






Recall 1.000 




0.989 




1.000 




1.000 




0.600 




0.989 






Fl 0.994 




0.994 




1.000 




1.000 




0.735 




0.989 
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Table 5: Results for the models in Table 4 on information measures from mutual-information and cross-entropy groups. (R)= ranking order for the model, where R = A,B, ..., in descending order 
from the top. 



Model 


Nil 


Nh 


Nh 


NU 


Nh 


Nh 


Nh 


Nh 


Nh 


Nhi 


Nhi 


A'/24 


Nhi 


Ml 

(C) 


0.831 
(D) 


0.831 
(D) 


0.893 
(B) 


0.862 
(D) 


0.860 
(D) 


0.861 
(D) 


0.755 
(D) 


0.831 
(D) 


0.893 
(D) 


0.998 

(A) 


0.998 

(A) 


0.998 

(A) 


0.998 

(A) 


M2 
(B) 


0.897 
(C) 


0.897 
(C) 


0.841 
(D) 


0.869 

(C) 


0.868 

(C) 


0.869 

(C) 


0.767 
(C) 


0.841 
(C) 


0.897 
(C) 


0.998 

(A) 


0.998 

(A) 


0.998 

(A) 


0.998 

(A) 


M3 
(B) 


1.000 
(A) 


0.929 
(B) 


0.909 

(A) 


0.955 
(A) 


0.952 
(A) 


0.953 
(A) 


0.909 

(A) 


0.909 

(A) 


1.000 

(A) 


0.969 
(D) 


0.000 
(B) 


0.484 
(C) 


0.000 
(B) 


M4 
(A) 


1.000 

(A) 


0.997 

(A) 


0.855 
(C) 


0.928 
(B) 


0.922 
(B) 


0.925 
(B) 


0.855 
(B) 


0.855 
(B) 


1.000 

(A) 


0.970 
(C) 


0.000 
(B) 


0.485 
(B) 


0.000 
(B) 


M5 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.000 


0.374 


0.548 


0.461 


0.495 


M6 


0.731 


0.731 


0.731 


0.731 


0.731 


0.731 


0.576 


0.731 


0.731 


1.000 


1.000 


1.000 


1.000 



to 

o 



Table 6: Results for the models in Table 4 on information measures from divergence group. S=singularity which cannot be removed. (R)= ranking order for the model, where R = A,B, ■., in 
descending order from the top. 



Model 


NIw 


NIn 


Nln 


NI,3 


NI,4 


Nll5 


^16 


Nl„ 


NI,s 


Nll9 


Nho 


Ml 
(C) 


0.9998 

(A) 


0.9998 

(A) 


0.9991 
(B) 


0.9998 

(A) 


0.9988 
(B) 


0.9997 

(A) 


0.9802 
(A) 


0.9983 
(B) 


0.9996 

(A) 


0.9977 
(B) 


0.9996 

(A) 


M2 
(B) 


0.9998 

(A) 


0.9998 

(A) 


0.9992 

(A) 


0.9998 

(A) 


0.9990 

(A) 


0.9997 

(A) 


0.9802 
(A) 


0.9985 

(A) 


0.9996 

(A) 


0.9979 

(A) 


0.9996 

(A) 


M3 
(B) 


0.9998 

(A) 


0.9996 
(D) 


0.9849 
(D) 


0.9926 
(D) 


0.9890 
(D) 


0.9898 
(D) 


0.9802 
(A) 


S 


0.9897 
(D) 


S 


S 


M4 
(A) 


0.9998 

(A) 


0.9998 

(A) 


0.9856 
(C) 


0.9928 
(C) 


0.9899 

(C) 


0.9900 
(C) 


0.9802 
(A) 


S 


0.9900 

(C) 


S 


S 


M5 


0.7827 


0.6473 


0.6189 


0.8540 


0.6002 


0.8129 


0.4966 


0.2775 


0.7550 


0.0455 


0.7406 


M6 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


1.0000 


S 



to 



to 
to 



Table 7: Numerical examples in Binary Classifications(n=100). (R)= ranking order for the model, where R = A,B, ■■■, in descending order from the top. 



Model 


Mia 




M2a 




M3a 




M4a 




Mlb 




M2b 




M3b 




M4b 




C 


94 
1 5 




93 1 
6 




94 
5 1 




93 1 
6 




95 
1 4 




94 1 
5 




95 
4 1 




94 1 
5 





CR 


0.99 


0.99 


0.99 


0.99 


0.99 


0.99 


0.99 


0.99 


(Rejection) 


(0.00) 


(0.00) 


(0.01) 


(0.01) 


(0.00) 


(0.00) 


(0.01) 


(0.01) 


Nh 


0.756 


0.874 


0.876 


0.997 


0.720 


0.864 


0.849 


0.997 


(Ranking) 


(D) 


(C) 


(B) 


(A) 


(D) 


(B) 


(C) 


(A) 



Table 8: Classification examples in three classes(Ci = 80,C2 = 15, C3 = 5).(R)= ranking order for the model, where R = A,B, ..., in descending order from the top. 



Model 


M7 






M8 






M9 






MIO 






Mil 






(Ranking) 


(C) 






(C) 






(B) 






(B) 






(B) 








\ 80 








1 




r 80 








1 




r 80 








1 




r 80 








1 




r 80 





1 1 




C 





15 













15 













15 










1 


14 













14 


1 






1 





4 










1 


4 













4 


1 










5 













5 




CR 


0.99 






0.99 






0.99 






0.99 






0.99 






Rej 


0.00 






0.00 






0.01 






0.00 






0.00 






Model 


M12 






M13 






M14 






M15 












(Ranking) 


(B) 






(B) 






(B) 






(A) 














80 













79 


1 










79 





1 







79 








1 








C 





14 





1 







15 













15 













15 






















5 













5 













5 













5 











CR 


0.99 






0.99 






0.99 






0.99 












Rej 


0.01 






0.00 






0.00 






0.01 













to 



Table 9: Results for the models in Table 8 on information measures from mutual-information and cross-entropy groups. S=singularity which cannot be removed. (R)= ranking order for the model, 
where R = A,B, ..., in descending order from the top. 



to 



Model 


Nil 


Nh 


Nh 


NI4 


Nh 


Nh 


Nh 


Nh 


Nh 


Nhi 


Nh2 


Nh3 


Nh4 


M7 
(F) 


0.912 
(F) 


0.912 
(F) 


0.957 
(C) 


0.935 
(G) 


0.934 
(G) 


0.934 
(G) 


0.876 
(F) 


0.912 
(H) 


0.957 
(E) 


0.998 
(D) 


0.998 

(C) 


0.998 

(C) 


0.998 

(C) 


M8 
(F) 


0.939 

(E) 


0.939 

(E) 


0.958 
(B) 


0.949 
(D) 


0.949 
(D) 


0.949 
(D) 


0.902 
(D) 


0.939 
(D) 


0.958 
(D) 


0.998 
(D) 


0.998 

(C) 


0.998 

(C) 


0.998 

(C) 


M9 

(C) 


1.000 

(A) 


0.951 
(D) 


0.961 

(A) 


0.980 
(A) 


0.980 
(A) 


0.980 
(A) 


0.961 

(A) 


0.961 

(A) 


1.000 

(A) 


0.982 
(G) 


0.000 
(G) 


0.491 
(I) 


0.000 
(G) 


MIO 
(E) 


0.912 
(F) 


0.912 
(F) 


0.938 
(F) 


0.925 
(I) 


0.925 
(I) 


0.925 
(I) 


0.860 
(H) 


0.912 
(H) 


0.938 
(G) 


0.999 

(A) 


0.999 

(A) 


0.999 

(A) 


0.999 

(A) 


Mil 
(E) 


0.956 
(D) 


0.956 
(C) 


0.941 
(E) 


0.948 
(E) 


0.948 
(E) 


0.948 
(E) 


0.902 
(D) 


0.941 
(C) 


0.956 

(E) 


0.998 
(B) 


0.998 

(C) 


0.998 

(C) 


0.998 

(C) 


M12 
(B) 


1.000 

(A) 


0.969 
(B) 


0.943 
(D) 


0.972 
(B) 


0.971 
(B) 


0.971 
(B) 


0.943 
(B) 


0.943 
(B) 


1.000 

(A) 


0.983 
(F) 


0.000 
(G) 


0.492 
(G) 


0.000 
(G) 


M13 
(D) 


0.939 

(E) 


0.939 

(E) 


0.915 
(I) 


0.927 
(H) 


0.927 
(H) 


0.927 
(H) 


0.863 
(G) 


0.915 
(G) 


0.939 
(F) 


0.999 

(A) 


0.999 

(A) 


0.999 

(A) 


0.999 

(A) 


M14 
(D) 


0.956 
(D) 


0.956 
(C) 


0.916 
(H) 


0.936 
(F) 


0.935 
(F) 


0.936 
(F) 


0.879 
(E) 


0.916 

(F) 


0.956 

(E) 


0.998 
(D) 


0.998 

(C) 


0.998 

(C) 


0.998 

(C) 


M15 
(A) 


1.000 

(A) 


0.996 

(A) 


0.919 
(G) 


0.960 

(C) 


0.958 
(C) 


0.959 

(C) 


0.919 

(C) 


0.919 

(E) 


1.000 

(A) 


0.984 
(E) 


0.000 
(G) 


0.492 
(G) 


0.000 
(G) 



Table 10: Results for the models in Table 8 on information measures from divergence group. S=singularity which cannot be removed. (R)= ranking order for the model, where R = A,B, ..., in 
descending order from the top. 



to 



Model 


NIw 


NIn 


Nln 


NI,3 


NI,4 


A'/l5 


^16 


Nl„ 


NI,s 


Nll9 


Nho 


M7 


0.9998 


0.9998 


0.9982 


0.9996 


0.9974 


0.9994 


0.9802 


0.9966 


0.9992 


0.9953 


0.9992 


(F) 


(A) 


(A) 


(D) 


(C) 


(E) 


(D) 


(A) 


(D) 


(D) 


(E) 


(D) 


M8 


0.9998 


0.9996 


0.9979 


0.9995 


0.9969 


0.9993 


0.9802 


0.9959 


0.9990 


0.9942 


0.9990 


(F) 


(A) 


(E) 


(E) 


(D) 


(F) 


(E) 


(A) 


(F) 


(F) 


(F) 


(F) 


M9 


0.9998 


0.9996 


0.9840 


0.9924 


0.9876 


0.9895 


0.9802 


S 


0.9893 


S 


S 


(C) 


(A) 


(E) 


(H) 


(G) 


(I) 


(H) 


(A) 




(H) 






MIO 


0.9998 


0.9997 


0.9994 


0.9999 


0.9992 


0.9998 


0.9802 


0.9988 


0.9997 


0.9984 


0.9997 


(E) 


(A) 


(C) 


(A) 


(A) 


(A) 


(A) 


(A) 


(B) 


(A) 


(C) 


(A) 


Mil 


0.9998 


0.9996 


0.9982 


0.9995 


0.9976 


0.9994 


0.9802 


0.9964 


0.9991 


0.9950 


0.9991 


(E) 


(A) 


(E) 


(D) 


(D) 


(D) 


(D) 


(A) 


(E) 


(E) 


(F) 


(E) 


M12 


0.9998 


0.9996 


0.9852 


0.9927 


0.9893 


0.9899 


0.9802 


S 


0.9898 


S 


S 


(B) 


(A) 


(E) 


(G) 


(F) 


(H) 


(G) 


(A) 




(H) 






M13 


0.9998 


0.9997 


0.9994 


0.9999 


0.9992 


0.9998 


0.9802 


0.9989 


0.9997 


0.9985 


0.9997 


(D) 


(A) 


(C) 


(A) 


(A) 


(A) 


(A) 


(A) 


(A) 


(A) 


(A) 


(A) 


M14 


0.9998 


0.9997 


0.9986 


0.9996 


0.9982 


0.9995 


0.9802 


0.9972 


0.9993 


0.9961 


0.9993 


(D) 


(A) 


(C) 


(C) 


(C) 


(C) 


(C) 


(A) 


(C) 


(C) 


(D) 


(C) 


M15 


0.9998 


0.9998 


0.9856 


0.9928 


0.9899 


0.9900 


0.9802 


S 


0.9900 


S 


s 


(A) 


(A) 


(A) 


(F) 


(E) 


(G) 


(F) 


(A) 




(G) 







